I am thrilled to dig into the world of open science and learn new tricks with the fascinating open source materials.
Link to my GitHub repository is here!
“Two things are infinite: the universe and human stupidity; and I’m not sure about the universe.” - Albert Einstein
Exercise focuses on data wrangling using an original dataset called “JYTOPKYS2” by Kimmo Vehkalahti (more information here), creating a subset of the dataset and analysing it with basic methods, as well as with regression analysis, which is this week’s topic.
The dataset is called “learning2014”, and it consists of 166 observations
in 7 columns, describing each participant’s learning habits.
Here are the first six rows and the structure of the dataset.
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Column names describe the following:
1. gender - Gender of the person
2. age - Age of the person
3. attitude - Global attitude toward statistics
4. deep - Combination of different questions regarding “deep learning”, as in seeking meaning, relating ideas, use of evidence in the studies (reference)
5. stra - Combination of different questions regarding “strategic learning”, as in organized studied, time management (reference)
6. surf - Combination of different questions regarding “surface learning”, as in lack of purpose, unrelated memorising and syllabus-boundness (reference)
7. points - Points from the statistics exam
From the structure we can see that six column values are numeric or integer values, and one value is a factor value, having levels “M” or “F”.
In the following code blocks can be found summaries of each variable, including the gender distribution (gender) and descriptive variables for numeric values. It can be seen that there are almost 50 % more female participants than male participants, mean age is 25.5 years, average attitude is around 3.1 (scale 1-5), and average values for deep, strategic and surface learning are between 2.8 and 3.7. Average points are 22.7, when the minimum is 7 and maximum is 33.
summary(learning2014$gender)
## F M
## 110 56
summary(learning2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(learning2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
summary(learning2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
summary(learning2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
In order to see the relationship between these variables, we will create scatter plots for all the into a scatter plot matrix.
# This matrix plot is created using ggplot2 and GGally libraries
ggpairs(learning2014, mapping = aes(col = gender, alpha=0.3),
lower = list(combo = wrap("facethist", bins = 20)))
By eyeing the plot, we can see that the highest positive correlation is between the ‘points’ and ‘attitude’, and highest negative correlation is between ‘surface learning’ and ‘deep learning’ (because they also are kinda opposite…).
The plots also show that the distribution of exam results is divided quite equally between female and male students, but that more girls have chosen higher values in surface learning and boys generally have a better attitude compared to girl participants.
In this part a regression analysis for the dataset will be conducted. For the analysis, which will describe the causality of the dependent variable ( in this case points ) with chosen explanatory variables. The analysis seeks to figure, whether the explanatory variables explain the dependent variable. The results will show whether there is a statistical significance, hence if the explanatory variables truly explain the dependent variable.
y ~ x1 + x2 + x3
In the regression model exam points (points) is the target variable (y) and the three variables chosen above explanatory variables. In R creating a regression model happens with formula lm():
model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to the summary table, only attitude is statistically significant explanatory variable to the dependant variable.
The residuals (distance from the regression line) are between -17.15 and 10.89 with median of 0.52, and with a total residual standard error 5.296.
The ‘estimates’ show the effect of the variables to points
Since there is only one statistically significant variable, the regression model will be fitted again only with that variable:
model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Also the p-value has decreased for both intercept value and attitude variable, suggesting that the statistical significance has increased.
Low values don’t necessarily mean that there is no significance - as we can see there is statistical significance. It is said that any field that attempts to explain human behavior typically has R-squared values below 50%.
Diagnostic plots are made to validate the regression model assumptions. In R it is easy and done with the basic plot() function by giving the model as the first argument. Assumptions are always part of the models, and the fact how well the model describes the phenomenon of interest depends on how will the assumptions fit reality.
We will create plots for Normal QQ-plot, Residuals vs Fitted values and Residuals vs Leverage.
plot(model2, 2)
plot(model2, 1)
plot(model2, 5)
According to the data and regression analyses and model validation, we could conclude that the model created is valid for studying the relationship between the exam points and students’ attitudes. Lastly, here is a plot visualizing the linear relation between these two variables.
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col =gender)) + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
# Draw the plot
p1
Data for this week’s exercise is retrieved from UCI Machine Learning Repository by Paulo Cortez in 2008 (University of Minho, Portugal). Data consists of two joined dataframes about students’ achievements in secondary education in two Portuguese schools. The first table was with students taking the math course, and the other taking the Portuguese course. For this exercise, the two datasets have been joined together (code for the data wrangling operations is here).
Glimpse of the data:
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | nursery | internet | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | higher | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | alc_use | high_use |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | yes | no | mother | 2 | 2 | 0 | yes | no | no | no | yes | no | 4 | 3 | 4 | 1 | 1 | 3 | 5 | 2 | 8 | 8 | 1.0 | FALSE |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | no | yes | father | 1 | 2 | 0 | no | yes | no | no | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 3 | 7 | 8 | 8 | 1.0 | FALSE |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | yes | yes | mother | 1 | 2 | 2 | yes | no | yes | no | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 8 | 10 | 10 | 11 | 2.5 | TRUE |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | yes | yes | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 1 | 14 | 14 | 14 | 1.0 | FALSE |
Column names for the dataset are:
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Structure of the data is the following:
Observations: 382
Variables: 35
summary(alc)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
## NA's :0
##
##
Information about the attributes:
- school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
- sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
- age - student’s age (numeric: from 15 to 22)
- address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
- famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
- Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
- Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
- guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
- traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)
these grades are related with the course subject, Math or Portuguese:
- G1 - first period grade (numeric: from 0 to 20)
- G2 - second period grade (numeric: from 0 to 20)
- G3 - final grade (numeric: from 0 to 20, output target)
these varables were created for the dataframe combining the alcohol consumption
- alc_use - Average alcochol use on weekdays and weekends together
- high_use - High alcohol use (alc_use > 2), logical (T/F)
Purpose of this analysis is to study the relationships between high and low alcohol consumption and four other chosen variables, and to numerically and graphically show the distributions between the variables.
Logistic regression is used to define the linear model for binary value “success”, the expected value of the target variable.
The dataset is quite interesting, because it provides multiple socially interesting factors to compare with each other. The four variables I chose to compare with the alcohol comsumption are health, G3 (total grades), studytime and absences. Generally research concludes that men exceed women in high-volume drinking, but research also says that there is a clear female excess for the risk of becoming an underaged drinker. That’s why I’m including the gender variable in my analysis as well.
health: Alcohol affects the human health, so logically high consumption would affect the health status, and that students who consume less alcohol would be healthier. So I believe there might be a connection, but of course other factors, like eating habits and sports affect more. G3 (grades): I believe there might be some connection, that students who consume more alcohol tend to get lower grades, but that the distribution is not that big.
studytime : I believe there can be seen some kind of correlation between alcohol consumption and studytime. absences: I believe high alcohol consumption leads to more school absences.
Comparison of predictive variables with the target variable (high_use):
# First plot: Distributions of high alcohol consumption and health status with gender distinction
g1 <- ggplot(alc, aes(x = high_use, y = health, col=sex)) + geom_boxplot() + ggtitle("Health status vs. alcohol consumption")
# Second plot: Distributions of high alcohol consumption and G3(grades) with gender distinction
g2 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + ggtitle("Grades vs. alcohol consumption")
# Third plot: Distributions of high alcohol consumption and study time with gender distinction
g3 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) + geom_boxplot() + ylab("Study time") + ggtitle("Study time vs. alcohol consumption")
# Fourth plot: Distributions of high alcohol consumption and school absences with gender distinction
g4 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absence") + ggtitle("Absences vs. alcohol consumption")
plot <- multiplot(g1, g2, g3, g4, cols=2) + geom_smooth(method = 'loess')
Summary statistics:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_health = mean(health), mean_studytime = mean(studytime), mean_grade = mean(G3), mean_absence = mean(absences))
## # A tibble: 4 x 7
## # Groups: sex [?]
## sex high_use count mean_health mean_studytime mean_grade mean_absence
## <fct> <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 F FALSE 156 3.38 2.34 11.4 4.22
## 2 F TRUE 42 3.40 2 11.7 6.79
## 3 M FALSE 112 3.71 1.88 12.2 2.98
## 4 M TRUE 72 3.88 1.64 10.3 6.12
As the data was grouped by gender, we can see results for both male and female high and low alcohol consumers.
In plot 1 in upper left, health status is generally the same for both male and female students, with a drop in the median value within low-alcohol consuming female students, which is slightly surprising. Male students have exactly same range of answers and the same median value of 4, when female high-alcohol consuming students’s answers are more dispersed. In the summary table we can see that the mean health statuses are quite the same for both genders, but high-alcohol consuming students have estimated their health status to be slightly better than low-alcohol consumers.
In plot 2 on the upper right the study times differ between genders more than between the consumption factor. Therefore my hypothesis that there is a correlation between alcohol consumption and study time does not quite make sense, but it can be seen from the statistics that boys study less but also both boys and girls who consume more alcohol study a bit less in average.
In plot 3 in down left, we can see that the grades are quite equally distributed with students who are low-alcohol consumers. Withing students who are high-alcohol students, girls get generally slightly better grades than boys, and only boys have clear outliers from the data (with bad grades) in both low- and high-alcohol consuming students. So according to my hypothesis, alcohol consumption has an effect on total grades, but might still not be the only explanatory factor.
In plot 4 in down right, can be seen that girls in both groups have more outlier individuals with many absences. As box the statistics, we can see that high-alcohol consuming students have averagely more absences in both gender classes.
First we will create a logistic regression model and study its summary and coefficients.
m <- glm(high_use ~ sex + health + studytime + G3 + absences , data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + health + studytime + G3 + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1871 -0.8325 -0.6075 1.0677 2.2106
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.42587 0.64922 -0.656 0.511845
## sexM 0.81236 0.25409 3.197 0.001388 **
## health 0.03196 0.08731 0.366 0.714366
## studytime -0.35579 0.16124 -2.207 0.027344 *
## G3 -0.06155 0.03665 -1.679 0.093089 .
## absences 0.08642 0.02299 3.759 0.000171 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 420.17 on 376 degrees of freedom
## AIC: 432.17
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) sexM health studytime G3 absences
## -0.42586809 0.81236413 0.03195670 -0.35578595 -0.06155365 0.08642112
We can see that the statistically significant variables by high alcohol consumption in order are absences, sex, studytime and grades.
Health is insignificant, and the p-value for grades is greater than 0.05, so I will exclude them from the next model and do it again.
Next let’s see the coefficients as odd ratios:
m <- glm(high_use ~ sex + studytime + absences , data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3993889 0.1760458 0.8977721
## sexM 2.2216751 1.3647760 3.6532630
## studytime 0.6686620 0.4847669 0.9067963
## absences 1.0928234 1.0467369 1.1459745
In this part the predictive power of the model will be explored.
# Fitting the model with statistically significant variables
m <- glm(high_use ~ sex + absences + studytime, data = alc, family = "binomial")
# Predict() the probability of high_use
probabilities <- predict(m, type = "response")
# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)
# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.23036649 0.06806283 0.29842932
## Sum 0.90575916 0.09424084 1.00000000
ggplot(alc, aes(x = probability, y = high_use, col=prediction)) + geom_point()
# Defining a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Calling loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, alc$probability)
## [1] 0.2565445
EXTRA:
When conducting a 10-fold cross-validation to the model, we get following results:
library(boot)
# K-fold cross validation, K = 10 because of the 10-fold CV
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801
This week’s exercise introduces classification and clustering as visual tools to explore statistical data.
The data used is a readily available R dataset in the MASS library called Boston. It contains data about housing values in Boston, USA. More information about the data can be found here.
crim = per capita crime rate by town.
zn = proportion of residential land zoned for lots over 25,000 sq.ft.
indus = proportion of non-retail business acres per town.
chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox = nitrogen oxides concentration (parts per 10 million).
rm = average number of rooms per dwelling.
age = proportion of owner-occupied units built prior to 1940.
dis = weighted mean of distances to five Boston employment centres.
rad = index of accessibility to radial highways.
tax = full-value property-tax rate per $10,000.
ptratio = pupil-teacher ratio by town.
black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat = lower status of the population (percent).
medv = median value of owner-occupied homes in $1000s.
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
As we can see the data consists of 506 observations and 14 variables.
In the next graph we can see the correlations between the variables with 0.05 significance level.
cor_matrix <- cor(Boston) %>% round(2)
res1 <- cor.mtest(Boston, conf.level = .95)
corrplot(cor_matrix, p.mat = res1$p, method = "color", type = "upper",
sig.level = c(.001, .01, .05), pch.cex = .9,
insig = "label_sig", pch.col = "white", order = "AOE")
As we can see the darker the red colour, the more positive correlation the two variables have, and the darker the blue colour, the stronger negative correlation the variables have. Almost all variables are significant in 0.05 % significance level. The strongest correlation, according to the matrix, is between the distance to employment centers and age and nitrogen oxyides concentration, lower status (%) and median value of owner-occupied homes, to name a few.
At this point the data will be scaled for later use. Scaling is necessary for the later linear discriminant analysis, because it assumes the variables are normally distributed and each variable has same variance.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling scaled the variables, and as we can see, the min and max values are different. Scale-function, with default settings, calculates the mean and standard deviation of the columns, then “scales” each element by those values by subtracting the mean and dividing by the sd. As we can see, the ‘age’ parameter has minus values.
Next I will create a categorical variable from the scaled crime rate variable. Summary of the crime variable:
# First converting the matrix into a dataframe
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
The break points will be the quantiles (25% - 100%). After creating the variable, I will remove the old ‘crim’ variable and replace it with the new ‘crime’, which, in the end, contains the amount of ‘cases’ in each quantile group.
# Creating a quantile vector, and creating a categorical variable 'crime'
bins <- quantile(boston_scaled$crim)
# Creating categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Table of the new crime variable
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Splitting the data into training and testing sets lets us check how well our model performs. Training of the model is done with the training set (80 % of the dataset), and the model is then tested and data predicted with the testing set. This allows us to see how well our model, for example, classifies different points into groups.
Next I will divide the dataset to train and test sets. The functions of the code is commented int he code block.
# Choosing 80 % of the data as the training set based on the number of rows in the dataset
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
# Testing set is the data minus the training set
test <- boston_scaled[-ind,]
# Saving the actual correct classes into a new variable
correct_classes <- test$crime
# Removing the crime variable from test data
test <- dplyr::select(test, -crime)
LDA is used to predict classes for new data and to find variables that either discriminate or separate the classes best (DataCamp). The difference between classification and clustering is, that in classification the classes are known and the model is trained with the training set from the data, and it classifies new values into classes. Clustering, on the other hand, means that the classes are unknown, but the data is grouped based on the similarities of the observations. If the assumptions of discriminant analysis are met, it is more powerful than logistic regression, but the assumptions are rarely met.
Linear discriminant analysis for training data where crime is the target and all other variables as predictors:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2524752 0.2351485 0.2599010
##
## Group means:
## zn indus chas nox rm
## low 0.9125722 -0.9387150 -0.11793298 -0.8673105 0.4736813
## med_low -0.1093329 -0.3488093 0.03646311 -0.5688043 -0.1690836
## med_high -0.3771136 0.1873251 0.10065938 0.4162256 0.1130320
## high -0.4872402 1.0170492 -0.15984049 1.0326176 -0.4020019
## age dis rad tax ptratio
## low -0.8856068 0.8641002 -0.6834950 -0.7579752 -0.4531402
## med_low -0.3458643 0.3104087 -0.5472552 -0.4732297 -0.1080697
## med_high 0.3799665 -0.3934712 -0.3713703 -0.2805897 -0.3572507
## high 0.8077246 -0.8400902 1.6388211 1.5145512 0.7815834
## black lstat medv
## low 0.38002361 -0.79485416 0.557351222
## med_low 0.32402858 -0.11904702 0.006026785
## med_high 0.08848333 -0.02028448 0.164792604
## high -0.83977741 0.83487436 -0.703580334
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09209741 0.70815743 -0.93258782
## indus 0.06436257 -0.42509973 -0.07768853
## chas -0.10524684 -0.03206491 0.18614819
## nox 0.35361675 -0.66730372 -1.30335811
## rm -0.08297439 -0.07536678 -0.27900932
## age 0.28570244 -0.26842679 -0.14145173
## dis -0.05357907 -0.21937735 0.04385065
## rad 2.97577488 1.01266314 -0.40547138
## tax 0.01657693 -0.09476859 1.00152467
## ptratio 0.07894599 0.08215838 -0.22638286
## black -0.12801408 -0.02286222 0.08739739
## lstat 0.21013320 -0.16474776 0.41952427
## medv 0.14909642 -0.37622721 -0.14332686
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9450 0.0388 0.0162
The LDA function calculates the group means for each variable, coefficients and the proportions of trace (which is the proportion of between-class variance that is explained by successive discriminant functions). Here there are three linear discriminants, since the crime variable has four classes (total number is n-1).
# Creating a biplot to visualize the LDA
# Creating an arrow function (DataCamp)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Converting the target classes as numeric ()
classes <- as.numeric(train$crime)
# Plotting the LDA results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Next phase is to fit the testing data to the LDA and predict the classes for the values. Since the correct values are stoder in the correct_classes variable, I will cross-tabulate the predicted values and the correct values to see whether the classifier classified the values correctly.
# Predicting the classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross-tabulating the correct classes and the predicted classes (lda.pred$class)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 10 0 0
## med_low 6 13 5 0
## med_high 1 14 16 0
## high 0 0 0 22
The model is best to classify high values, but the lower the value, the less correct the prediction.
In order to find clusters for the dataset, I will reload the Boston dataset and standardize it. Then I will calculate the distances between the observations and use k-means to find the optimal amount of clusters. The optimal amount of clusters is calculated with the total within cluster sum of squares by adding a WCSS to every cluster nad adding them together. The optimal number is reached, when the WCSS drops radically.
Comments are in the codeblock.
# Loading the Boston dataset
data(Boston)
# Scaling the data to get comparable distances
scaled <- scale(Boston)
# Calculating the Euclidean distances for the dataset
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.620 170.500 226.300 371.900 626.000
# Calculating also the distance with "Manhattan"-method
dist_man <- dist(Boston, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.100 279.500 342.900 509.700 1198.000
As we can see, the manhattan method makes the distances longer. Next we do the K-means clustering and find the optimal number of clusters. Comments are in the codeblock.
library (ggplot2)
set.seed(123)
# First deetermining the number of clusters in order to find the ideal amount
k_max <- 10
# Calculating the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# We see that 2 clusters is a good amount (the point where the line rapidly drops)
# K-means clustering
km <-kmeans(Boston, centers = 2)
# Plotting the Boston dataset with clusters, taking columns 6 to 10 for reference
pairs(Boston[6:10], col = km$cluster)
Here we see that the data is classified into two different clusters, and that the groups are quite separate from each others, except some points (like in ‘rad’ and all ‘dis’, ‘age’ and ‘rm’ variables).
This week’s exercise is about reducing the dimensionality of multidimensional data. The dataset originates from the United Nations Development Programme and includes 155 observations from different countries and 8 variables, that are:
HDI ranking HDI, Life expectancy, Expected years of education, Mean education time, Gross national income, GNI minus ranking, Gender inequality index (GII) ranking, GII, Maternal mortality, Adolescent birth rate, Females in the parliament, Secondary education for Females and Males, Labor force participation for F and M And ratios of Females and Males in secondary education and labor force.
The data description can be found here and techiques for calculating each of these variables can be found here.
Graphical overview and data structure:
# Structure of the data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
All variables are numeric. Interesting how the the maximum representation of females in parliament is 57.50 % and minimum is zero… Next two graphs show the structure of the variables and the correlations between them.
# Visualizing variables
ggpairs(human)
cor(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI
## Edu2.FM 1.000000000 0.009564039 0.5760299 0.59325156 0.43030485
## Labo.FM 0.009564039 1.000000000 -0.1400125 0.04732183 -0.02173971
## Life.Exp 0.576029853 -0.140012504 1.0000000 0.78943917 0.62666411
## Edu.Exp 0.593251562 0.047321827 0.7894392 1.00000000 0.62433940
## GNI 0.430304846 -0.021739705 0.6266641 0.62433940 1.00000000
## Mat.Mor -0.660931770 0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415 0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F 0.078635285 0.250232608 0.1700863 0.20608156 0.08920818
## Mat.Mor Ado.Birth Parli.F
## Edu2.FM -0.6609318 -0.5294184 0.07863528
## Labo.FM 0.2404611 0.1201589 0.25023261
## Life.Exp -0.8571684 -0.7291774 0.17008631
## Edu.Exp -0.7357026 -0.7035649 0.20608156
## GNI -0.4951623 -0.5565621 0.08920818
## Mat.Mor 1.0000000 0.7586615 -0.08944000
## Ado.Birth 0.7586615 1.0000000 -0.07087810
## Parli.F -0.0894400 -0.0708781 1.00000000
# Creating a correlation matrix and visualizing the correlations
res1 <- cor.mtest(human, conf.level = .95)
cor(human) %>% corrplot(p.mat = res1$p, method = "color", type = "upper",
sig.level = c(.001, .01, .05), pch.cex = .9,
insig = "label_sig", pch.col = "white", order = "AOE")
Most of the variables are not normally distributed, and there is positive skewness for example in maternal mortality, adolescence birth rate and GNI, and negative skewness in labor ratio and life expectancy. Quite many of the variables correlate with each other rather well, considering that the data and variables are quite well known and straight-forward. Biggest positive correlations can be seen between maternal mortality and adolescence birth ratio, life expectancy and expected years in education, and negative correlation between maternal mortality and life expectancy. As a single variable, adolescent birth date (also referred to as the age-specific fertility rate for women aged 15–19) has quite significant correlation with all variables, since the higher fertility rate for women under 19 years old, the smaller is the life expectancy and education, but higher risk or maternal mortality. From the correlation matrix we can see that almost all variables are statistically significant with 95 % confidence level.
At this part of the analysis I will perform PCA first on the non standardized data, and after on the standardized data. PCA is a statistical unsupervised procedure, one of dimensionality reduction techniques, which are used to reduce the “noise” and error of the data, and to find the phenomenon of interest by focusing only on the essential dimensions. In PCA the data is first transformed to another ‘space’, with equal number of dimensions with the original data. The first PC has the maximum amount of variance from the original dataset, the second has what the first one didn’t capture, third one what the second one didn’t and so on, giving uncorrelated variables. Therefore usually the first few principal components are sufficient to represent the data.
First PCA on non standardized data:
# Performing PCA on non standardized human data
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# Drawing a biplot of the PC1 and PC2
s <- summary(pca_human)
# Drawing a biplot of the PC1 and PC2
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Without any kind of standardization, the PCA performs quite badly, as we see from the plot above. PCA projects the original data to direction with maximum variance - hence, without rescaling, PCA will load on the large variances, and therefore only one component explains all the data and correlation (PC1 100%) vs PC2 0%). As in the plot above, we see that the data has been indeed loaded mostly on the frst component, and doesn’t show any meaningful results. Only Qatar, and a few other variables stand out in this plot. Arrows show the direction to the original variables and PC’s. GNI variable stands out in this plot substantially because of its length, which is propotional to its standard deviation. It’s arrow is also pointing to the second PC as the only variable.
Next PCA on standardized data, where mean is 0:
# Standardizing data
human_std <- scale(human)
# Performing PCA on standardized human data
pca_human <- prcomp(human_std)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# Print the summary
s <- summary(pca_human)
# Drawing a biplot of the PC1 and PC2
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Here the data has been scaled so that the data is independent of any rescaling, and is therefore easier to interpret. From the table we see that the data is more evenly distributed between the components. From the arrows we can see the correlations between the variables, for example, expected years of education and life expetancy are very close to each other, and they have correlation 0.79, and negative correlation with the arrows on the positive side of the x-axis. Also we see that maternal mortality rate and adolescent birth rate are very highly correlated, but have negative correlation with expected education and life expectancy respectively. Female/male labor force ratio and female representation in the parliament variables are with both sides (negative and positive of x-axis) close to zero, implying quite neutral correlation. The angle between points and variables imply their correlation. For example, from the plot one could say based on the country names that least developed countries are more likely to have negative correlation between life expectancy, for example and to be closer to the arrows for maternal mortality rate, and more developed countries situated close to higher values for expected years of education, such as Switzerland. We can also say that the variables are pointing quite evenly to both dimensions based on the directions of the arrows.
Multiple correspondace analysis is also a dimensionality reduction method, which is used to analyse several categorical variables.
First I will wrangle the data to our liking, keeping only cetrain columns from the ‘tea’ data. All the variables are categorical factor variables.
require(FactoMineR)
## Loading required package: FactoMineR
data("tea")
tea_time = tea[, c("Tea", "How", "how", "sugar", "where", "lunch")]
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
Visualization of the variables:
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(aes(fill = "red")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Next I’ll perform MCA analysis on the tea_time data:
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
From the summary table we can read the variances and the predentages of variances for each dimension. Individuals show the rows (here only 10 first), contribution and squared correlations, and categories the names of categories. 5 variables have v.values above 1.96/less than -1.96, which means its coordinate is significantly different from zero. The last box, categorial variables shows the links between dimensions and variables, and the closer to 1, the stronger link. Strongest are low-capital ‘how’ and ‘where’ variables to Dim.1. Next is a plotted MCA:
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The categories under one variable have the same colour. Quite many variables are similar to each other, and the most significant outliers are tea shop and unpacked. From the previous table we can see that these two variables belong to the ‘how’ and ‘where’ groups, so as predicted, they are far from zero.
cats <- apply(tea_time, 2, function(x) nlevels(as.factor(x)))
# data frame with variable coordinates
mca1_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats), cats))
# data frame with observation coordinates
mca1_obs_df = data.frame(mca$ind$coord)
# plot of variable categories
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(colour = "gray50", alpha = 0.7) +
geom_density2d(colour = "gray80") +
geom_text(data = mca1_vars_df,
aes(x = Dim.1, y = Dim.2,
label = rownames(mca1_vars_df), colour = Variable)) +
ggtitle("MCA plot of variables using R package FactoMineR") +
scale_colour_discrete(name = "Variable")
From this plot we can see both observations and categories, as well as density curves to see where the observations are concentrated.
This week’s theme is longitudinal data, which is the name for data that has repeated measures; if a response variable is measured under different conditions over different times on a same individual/subject, it is called longitudinal data. This kind of data helps to understand the data by identifying the observations as individuals, and data type occurs most frequently in for example psychological testing, when a subject is tested over a certain time. The data and literature used for this week’s exercise is produced by Kimmo Vehkalahti (Vehkalahti and Everitt, 2019), and I will explore the datasets with help from Chapters 8 & 9 from the book.
The first data is from a study by Crowder and Hand (1990), in which they examined nutrition of three groups of rats that were put on different diets, and each animal’s body weight in grams was recorder over a nine-week period (Vehkalahti and Everitt, 2019, Chapter 9). Here the column ‘rats’ means weights in grams.
Second BPRS data, by Davis (2002) is a study in which 40 males were assigned to two different treatments and observed once a week for eight weeks and given a brief psychiatric rating score (BPRS), which assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe).The scale is used to evaluate patients suspected of having schizophrenia (source: DataCamp).
These two datasets have been converted from wide form into longitudinal form.
Next I present the structure and summaries of the datasets:
# Reading the data
BPRS <- read.csv("~/Documents/GitHub/IODS-project/data/BPRSL.csv", sep = ",", header = TRUE, row.names = 1)
RATS <- read.csv("~/Documents/GitHub/IODS-project/data/RATSL.csv", sep = ",", header = TRUE, row.names = 1)
head(RATS)
## ID Group time rats Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
head(BPRS)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
str(RATS)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ rats : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
str(BPRS)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
summary(RATS)
## ID Group time rats
## Min. : 1.00 Min. :1.00 WD1 :16 Min. :225.0
## 1st Qu.: 4.75 1st Qu.:1.00 WD15 :16 1st Qu.:267.0
## Median : 8.50 Median :1.50 WD22 :16 Median :344.5
## Mean : 8.50 Mean :1.75 WD29 :16 Mean :384.5
## 3rd Qu.:12.25 3rd Qu.:2.25 WD36 :16 3rd Qu.:511.2
## Max. :16.00 Max. :3.00 WD43 :16 Max. :628.0
## (Other):80
## Time
## Min. : 1.00
## 1st Qu.:15.00
## Median :36.00
## Mean :33.55
## 3rd Qu.:50.00
## Max. :64.00
##
summary(BPRS)
## treatment subject weeks bprs week
## Min. :1.0 Min. : 1.00 week0 : 40 Min. :18.00 Min. :0
## 1st Qu.:1.0 1st Qu.: 5.75 week1 : 40 1st Qu.:27.00 1st Qu.:2
## Median :1.5 Median :10.50 week2 : 40 Median :35.00 Median :4
## Mean :1.5 Mean :10.50 week3 : 40 Mean :37.66 Mean :4
## 3rd Qu.:2.0 3rd Qu.:15.25 week4 : 40 3rd Qu.:43.00 3rd Qu.:6
## Max. :2.0 Max. :20.00 week5 : 40 Max. :95.00 Max. :8
## (Other):120
First I will show a graphical display of the individual response profiles before and after standardization.
# Draw the plot RATS
ggplot(RATS, aes(x = Time, y = rats, linetype = as.factor(ID))) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$rats), max(RATS$rats)))
We see that in each group almost all individuals follow the same curve, which goes first up, and then decreases to almost same levels as in the beginning. We can also note that there are big weight differences between all groups, increasing by group number. There is also one significant outlier in the second group. Also we note that rats who are heavier in the beginning also have higher scores in the end, a phenomenon called tracking (Vehkalahti and Everitt, 2019, Chapter 8). Tracking can be better seen from standardized data below.
# Standardizing bprs values
RATSS <- RATS %>%
group_by(Time) %>%
mutate(stdrats = (rats - mean(rats))/sd(rats) ) %>%
ungroup()
# Plot
ggplot(RATSS, aes(x = Time, y = stdrats, linetype = as.factor(ID))) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
In order to better understand the data, it is useful to calculate mean profiles for each group and compare them with each other. Choosing a summary method depends on the question of interest. I will present simple summaries with next two plots. This is done below by calculating the mean and standard error, for variation, for each treatment group:
# Calculating mean and standard error
# Length of time
RATS$Group <- factor(RATS$Group)
n <- RATS$Time %>% unique() %>% length()
# Summarizing
RATSsum <- RATS %>%
group_by(Group, Time) %>%
summarise( mean = mean(rats), se = sd(rats/sqrt(n)) ) %>%
ungroup()
# Plot the mean profiles
ggplot(RATSsum, aes(x = Time, y = mean, linetype = as.factor(Group), shape = as.factor(Group))) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
scale_y_continuous(name = "mean(rats) +/- se(rats)")
In the summary profile we see that the groups are quite different from each other, with greatest variation (standard error) in the second group, highest values in the third group and lowest values with smallest standard error in the first group. Clearly the diet made the rats just fatter over time. Next, in order to compare the groups with each other better, I will draw boxplots to spot outliers and remove them so it won’t bias the further comparisons.
# Summary
RATStime <- RATS %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(rats) ) %>%
ungroup()
ggplot(RATStime, aes(x = Group, y = mean, )) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(rats), days")
We can see from the boxplots above that in each group there is one outlier. I will try to remove at least two of them and see what kind of difference it makes.
RATStime1 <- filter(RATStime, mean > 250 & mean < 550)
ggplot(RATStime1, aes(x = Group, y = mean, )) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(rats), days")
I removed two of the outliers, which decreased the variance withing the groups, but no real evidence of the changes. Next I will run a t-test and calculate confidence intervals on the data to assess differences. Since there are three groups, I will use ANOVA to find out if there is any significant difference between the average weights of rats. In the Data Camp a two-sided t-test was performed, but since there are three groups, I found using ANOVA suits best.
res.aov <- aov(mean ~ Group, data = RATStime1)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 189317 94659 611.4 5.32e-12 ***
## Residuals 11 1703 155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not surprisingly, there is significant difference between the means. This is still very shallow and simple graphical analysis of the observations, and requires deeper digging.
ggplot(BPRS, aes(x = week, y = bprs, linetype = as.factor(subject))) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))
Next I will fit a regression model by still ignoring the repeated-measures structure:
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.8817 2.2949 19.993 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
We can see that treatment groups have no statistical significance, however the time variable ‘week’ has. Next a more formal analysis is to fit a random intercept model for the explanatory variables treatment and week. This allows the linear regression fit for each man to differ in intercept from other men, by introducing random effects to subjects.
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.8817 2.4413 18.794
## week -2.2704 0.2084 -10.896
## treatment 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.341
## treatment -0.661 0.000
Results of the random intercept model are above. We can see, for example, that the variability of a single subject (man) is 6.885. Finally I will fit a random intercept and andom slope model. It allows the linear regression fits for each individual in the data to differ in intercept but also in slope. This way it is possible to account for the individual differences in the men’s bprs values, but also the effect of time (Source: DataCamp). So in the next summary we see the differences in the men’s bprs value change and the effect of time.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
#Summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8201 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.8817 2.5685 17.864
## week -2.2704 0.2977 -7.626
## treatment 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.477
## treatment -0.608 0.000
# ANOVA test
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA-test results we can see that the p value of the likelihood test between BPRS_ref1 (intercept) and BPRS_ref1 (slope) is 0.026 and significant in 99 % confidence level. The lower the value, the better the fit of the model.